Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(usmap)
library(maps)
library(ggplot2)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::map()    masks maps::map()
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(viridis)
## Loading required package: viridisLite
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/16196/OneDrive/Documents/R/win-library/3.6/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/16196/OneDrive/Documents/R/win-library/3.6/rgdal/proj
##  Linking to sp version: 1.4-1
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggmap':
## 
##     wind
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
require(janitor)
## Loading required package: janitor
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
require(reshape2)
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
require(tidyr)
require(rstan)
## Loading required package: rstan
## Loading required package: StanHeaders
## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## For improved execution time, we recommend calling
## Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
## although this causes Stan to throw an error on a few processors.
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
require(rstanarm)
## Loading required package: rstanarm
## Loading required package: Rcpp
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## rstanarm (Version 2.19.2, packaged: 2019-10-01 20:20:33 UTC)
## - Do not expect the default priors to remain the same in future rstanarm versions.
## Thus, R scripts should specify priors explicitly, even if they are just the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
## 
## Attaching package: 'rstanarm'
## The following object is masked from 'package:rstan':
## 
##     loo
require(bayesplot)
## Loading required package: bayesplot
## This is bayesplot version 1.7.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting



Data

Finaldata<- read.csv("time_series_final_Data.csv")
google<-read.csv("Google/googlefinal.csv")
names(google)
## [1] "X.1"                "X"                  "Day"               
## [4] "State"              "ChinaVirusInterest" "KungFluInterest"   
## [7] "Region"

Re-Cleaned

google <- google %>%  select(-c(X.1))
Finaldata <- Finaldata %>% select(-c(X)) 
Finaldata <- Finaldata %>% mutate(Day= as.Date(Day))

Dataset Description

Demographic:

Our dataset, Demographic, was created by merging three other datasets which all contained different demographic and election information. We obtained the main portion of our demographic data from the US Census Bureau’s American Community Survey (ACS) which is an ongoing survey administered by the U.S. Census Bureau. It gathers information on income, employment, housing characteristics, etc, annually for all the 50 U.S. States on the county and state level. To access the county-level dataset we used the R package called Choroplethr which provides API connections to data sources like the ACS. The ACS County-Level dataset was then merged with a county-level election outcome dataset that was created by Tony McGoven. Tony’s dataset contained presidential election results for 2008,2012, and 2016 but we chose to focus solely on the most recent election,2016. That said, the 2016 election results at the county-level were scraped from results published by Townhall.com. However, the State of Alaska reports results at the precinct or state level so there was no county-level data available. Therefore, another dataset had to be created that contained the election results for Alaska and this was done using the official election results provided by the Alaska Division of Elections and was later merged in. The final dataset that was used came from Alicia Johnson and it contained information on a state’s political leaning. Meaning it categorizes each county as belonging to a blue/red/purple state based on the state categorizations at 279towin.

COVID-19 Cases

The COVID-19 data is provided by The COVID Tracking Project(CTP). All of the data points come from state/district/territory public health authorities—or, occasionally, from trusted news reporting, official press conferences, or (very occasionally) tweets or Facebook updates from state public health authorities or governors. These numbers are updated daily at 4PM EST. The biggest weakness of this dataset is that there is no standardized methods for states to follow for data collection/report. For example, some states, like Oregon, provide the full set of numbers but others provide some or none of these numbers on an ongoing basis. Some crucial states in this outbreak, notably California, Washington, and New York, have not been regularly reporting their total number of people tested. The CTP aims to remedy this uncertainty in states by utilizing other reporting/measuring tools such as: “Directly asking state officials, watching news conferences, gleaning information from trusted news sources, and whatever else it takes to present reliable numbers.”

Google Search Interest

This data set includes two search interest indexes over time, measuring how people in each of the state’s interest in searching either “Kung Flu” or “China Virus” based on the time frame selected in the search. This data is downloaded directly from Google Trends which uses the same technique to track the interest of all searches on the platform. The main downside to this data set is the method of the indexing which makes the comparison from state to state less meaningful since each state is guaranteed to have a 100-level interest on their peak day, and the actual unknown search values can vary greatly across different states.

Dataset Details

dim(Finaldata)
## [1] 765  31
names(Finaldata)
##  [1] "Day"                    "State"                 
##  [3] "ChinaVirusInterest"     "KungFluInterest"       
##  [5] "positive"               "negative"              
##  [7] "death"                  "hospitalized"          
##  [9] "totalTestResults"       "FIPS"                  
## [11] "StayAtHome_date"        "Quarantine_Yes"        
## [13] "polyname"               "StateColor"            
## [15] "total_2016"             "dem_2016"              
## [17] "gop_2016"               "oth_2016"              
## [19] "total_population"       "percent_white"         
## [21] "percent_black"          "percent_asian"         
## [23] "percent_hispanic"       "per_capita_income"     
## [25] "median_rent"            "median_age"            
## [27] "percent_democrat2016"   "percent_republican2016"
## [29] "percent_other2016"      "Winner"                
## [31] "Region"
head(Finaldata)
##          Day State ChinaVirusInterest KungFluInterest positive negative
## 1 2020-03-10    AK                 27               0        0       23
## 2 2020-03-11    AK                  0               0        0       46
## 3 2020-03-12    AK                 26               0        0       46
## 4 2020-03-13    AK                 28               0        1       59
## 5 2020-03-14    AK                 30               0        1      143
## 6 2020-03-15    AK                 61               0        1      143
##   death hospitalized totalTestResults FIPS StayAtHome_date Quarantine_Yes
## 1    NA           NA               23    2      03/28/2020              1
## 2    NA           NA               46    2      03/28/2020              1
## 3    NA           NA               46    2      03/28/2020              1
## 4    NA           NA               60    2      03/28/2020              1
## 5    NA           NA              144    2      03/28/2020              1
## 6    NA           NA              144    2      03/28/2020              1
##   polyname StateColor total_2016 dem_2016 gop_2016 oth_2016
## 1   alaska        red     318608   116454   163387    38767
## 2   alaska        red     318608   116454   163387    38767
## 3   alaska        red     318608   116454   163387    38767
## 4   alaska        red     318608   116454   163387    38767
## 5   alaska        red     318608   116454   163387    38767
## 6   alaska        red     318608   116454   163387    38767
##   total_population percent_white percent_black percent_asian
## 1           741456          0.65          0.04          0.07
## 2           741456          0.65          0.04          0.07
## 3           741456          0.65          0.04          0.07
## 4           741456          0.65          0.04          0.07
## 5           741456          0.65          0.04          0.07
## 6           741456          0.65          0.04          0.07
##   percent_hispanic per_capita_income median_rent median_age
## 1             0.07             34922          NA       34.7
## 2             0.07             34922          NA       34.7
## 3             0.07             34922          NA       34.7
## 4             0.07             34922          NA       34.7
## 5             0.07             34922          NA       34.7
## 6             0.07             34922          NA       34.7
##   percent_democrat2016 percent_republican2016 percent_other2016     Winner
## 1            0.3655087              0.5128151         0.1216762 Republican
## 2            0.3655087              0.5128151         0.1216762 Republican
## 3            0.3655087              0.5128151         0.1216762 Republican
## 4            0.3655087              0.5128151         0.1216762 Republican
## 5            0.3655087              0.5128151         0.1216762 Republican
## 6            0.3655087              0.5128151         0.1216762 Republican
##   Region
## 1   West
## 2   West
## 3   West
## 4   West
## 5   West
## 6   West
summary(Finaldata)
##       Day                 State     ChinaVirusInterest KungFluInterest 
##  Min.   :2020-03-10   AK     : 15   Min.   :  0.00     Min.   : 0.000  
##  1st Qu.:2020-03-13   AL     : 15   1st Qu.: 26.00     1st Qu.: 0.000  
##  Median :2020-03-17   AR     : 15   Median : 37.00     Median : 0.000  
##  Mean   :2020-03-17   AZ     : 15   Mean   : 37.24     Mean   : 3.903  
##  3rd Qu.:2020-03-21   CA     : 15   3rd Qu.: 49.00     3rd Qu.: 5.000  
##  Max.   :2020-03-24   CO     : 15   Max.   :100.00     Max.   :58.000  
##                       (Other):675                                      
##     positive          negative         death          hospitalized    
##  Min.   :    0.0   Min.   :    0   Min.   :  0.000   Min.   :   0.00  
##  1st Qu.:   10.0   1st Qu.:   94   1st Qu.:  1.000   1st Qu.:   3.75  
##  Median :   38.0   Median :  337   Median :  2.000   Median :  32.50  
##  Mean   :  322.2   Mean   : 1820   Mean   :  8.728   Mean   : 210.91  
##  3rd Qu.:  152.0   3rd Qu.: 1447   3rd Qu.:  5.000   3rd Qu.:  73.75  
##  Max.   :25665.0   Max.   :65605   Max.   :210.000   Max.   :3234.00  
##                    NA's   :41      NA's   :416       NA's   :709      
##  totalTestResults      FIPS         StayAtHome_date Quarantine_Yes
##  Min.   :    0    Min.   : 1.00   03/24/2020: 90    Min.   :0.00  
##  1st Qu.:  103    1st Qu.:16.00   03/28/2020: 75    1st Qu.:1.00  
##  Median :  380    Median :29.00   03/30/2020: 75    Median :1.00  
##  Mean   : 2045    Mean   :28.96   03/23/2020: 60    Mean   :0.88  
##  3rd Qu.: 1500    3rd Qu.:42.00   03/25/2020: 60    3rd Qu.:1.00  
##  Max.   :91270    Max.   :56.00   (Other)   :300    Max.   :1.00  
##                                   NA's      :105    NA's   :15    
##        polyname    StateColor    total_2016         dem_2016      
##  alabama   : 15   blue  :285   Min.   : 248742   Min.   :  55949  
##  alaska    : 15   purple:165   1st Qu.: 730628   1st Qu.: 266827  
##  arizona   : 15   red   :315   Median :1922218   Median : 779535  
##  arkansas  : 15                Mean   :2489256   Mean   :1188393  
##  california: 15                3rd Qu.:3208899   3rd Qu.:1534487  
##  colorado  : 15                Max.   :9631972   Max.   :5931283  
##  (Other)   :675                                                   
##     gop_2016          oth_2016      total_population   percent_white   
##  Min.   :  11553   Min.   :  8496   Min.   :  570134   Min.   :0.2300  
##  1st Qu.: 345598   1st Qu.: 38767   1st Qu.: 1583364   1st Qu.:0.5900  
##  Median : 947934   Median : 91364   Median : 4361333   Median :0.7400  
##  Mean   :1179254   Mean   :121608   Mean   : 6108975   Mean   :0.7029  
##  3rd Qu.:1535513   3rd Qu.:183694   3rd Qu.: 6819579   3rd Qu.:0.8300  
##  Max.   :4681590   Max.   :515968   Max.   :37659181   Max.   :0.9400  
##                                                                        
##  percent_black    percent_asian     percent_hispanic per_capita_income
##  Min.   :0.0000   Min.   :0.01000   Min.   :0.0100   Min.   :20618    
##  1st Qu.:0.0300   1st Qu.:0.01000   1st Qu.:0.0400   1st Qu.:24635    
##  Median :0.0700   Median :0.02000   Median :0.0800   Median :26824    
##  Mean   :0.1084   Mean   :0.03765   Mean   :0.1082   Mean   :28098    
##  3rd Qu.:0.1500   3rd Qu.:0.04000   3rd Qu.:0.1300   3rd Qu.:30469    
##  Max.   :0.4900   Max.   :0.37000   Max.   :0.4700   Max.   :45290    
##                                                                       
##   median_rent       median_age    percent_democrat2016
##  Min.   : 448.0   Min.   :29.60   Min.   :0.2249      
##  1st Qu.: 564.0   1st Qu.:36.30   1st Qu.:0.3609      
##  Median : 658.0   Median :37.60   Median :0.4670      
##  Mean   : 714.3   Mean   :37.66   Mean   :0.4501      
##  3rd Qu.: 838.0   3rd Qu.:39.00   3rd Qu.:0.5335      
##  Max.   :1220.0   Max.   :43.20   Max.   :0.9285      
##  NA's   :15                                           
##  percent_republican2016 percent_other2016        Winner          Region   
##  Min.   :0.04122        Min.   :0.01937   Democrat  :315   Midwest  :195  
##  1st Qu.:0.41161        1st Qu.:0.04160   Republican:450   Mountain :120  
##  Median :0.49064        Median :0.05071                    Northeast:180  
##  Mean   :0.49010        Mean   :0.05976                    South    :195  
##  3rd Qu.:0.58095        3rd Qu.:0.06991                    West     : 75  
##  Max.   :0.70052        Max.   :0.25598                                   
## 

Variables of Interest

Variables: Description:

polyname

State Name

StateColor

Political Leaning

percent_hispanic

Percent of the Population that is Hispanic

percent_white

Percent of the Population that is White

percent_asian

Percent of the Population that is Asian

percent_black

Percent of Population that is Black

total_population

Total State Population

per_capita_income

Income per Capita

percent_democrat2016

Percent of votes won by Democrat (Clinton)

percent_republican2016

Percent of votes won by Republican (Trump)

Winner

Indicator for whether a Republican or Democrat Won

total_2016

Total Number of Votes

Positive

Number of reported positive COVID-19 cases

Negative

Number of reported negative COVID-19 cases

date

Date of report

death

Total Number of reported deaths due to COVID-19

hospitalized

Total Number of indivudals hopitalized due to COVID-19

totalTestResults

Total Number test results (Positive +Negative)

FIPS

A five-digit Federal Information Processing Standards code which uniquely identified counties and county

KungFluInterest

Interest index from google searches by state. Peak search day=100, all other days in set are based searches on relative to this peak day.

ChinaVirusInterest

Interest index from google searches by state. Peak search day=100, all other days in set are based searches on relative to this peak day.

Region

States divided into five different regions: West, South, Mountain, Northeast, Midwest

StayAtHome_date

The date states have enforced quarantine

Quarantine_Yes

An 0,1 indicator of whether states have enforced quarantine



Visualizations

Google Set

Trump<- as.Date("03/16/2020", format = "%m/%d/%Y")
day1<-as.Date("03/10/2020",format = "%m/%d/%Y")
day2<-as.Date("03/24/2020",format = "%m/%d/%Y")

google_ts<-google%>%
  mutate(Day = as.Date(as.character(Day)))

a<-google_ts %>% 
  #filter(Day<=day2)%>%
  #filter(Day>=day1)%>%
  group_by(Region, Day) %>% 
  summarize(ChinaVirusSearch = median(ChinaVirusInterest)) %>% 
  ggplot(aes(x=Day, y=ChinaVirusSearch, color=Region))+
  geom_point()

ggplotly(a)

This plot shows the relationship of “China Virus” search interest over grouped by region. This plots shows that there are certainly key events that trigger an uptick in searches overall. In this plot it is not clear which region may search China Virus more or less often, but it does show a that the regions move together in search interest, which would imply federal level events like a Donald Trump tweet to trigger these interest spikes.

Demographic

#Demographic: Identification 
## Might not use, how helpful is this? Possibly combine into a shiny app so people could adjust for which race to look at? 
Finaldata <- data.frame(Finaldata) %>% mutate(state = State)
plot_usmap(data = Finaldata, values = "percent_white", color = "white") + 
  scale_fill_continuous(name = "Percent White", label = scales::comma) + 
  theme(legend.position = "right")+ ggtitle("Percent of Residents that Identify as White") + 
  theme(
plot.title = element_text(color="Black", size=14, face="bold")
)

Our first visualization is looking at the percent of residents that identify as white within the United States. As you can see, there is a higher percent of white identifying residents in the midwest and northeast states. From this visualization we can also see that places like Texas, California, and New Mexico have much lower white identifying residents which could provide important information for us in our actual analysis.

plot_usmap(data = Finaldata, values = "percent_asian", color = "white") + 
  scale_fill_continuous(low = "sky blue", high = "black", name = "Percent Asian", label = scales::comma) + 
  theme(legend.position = "right") + ggtitle("Percent of Residents that Identify as Asian") +
  theme(
plot.title = element_text(color="Black", size=14, face="bold")
) 

The above visualization is looking at the percent of residents that identify as Asian within the United States. As you can see, there is a higher percent of residents that identify as Asian in places like California and Washington, but the most being in Hawaii. From this visualization we can also see that that Midwest and the South tend to have a small percentage of residents identifying as Asian. We are especially interested in the percent_asian variable as it plays a major role in our analysis.

#By income 
plot_usmap(data = Finaldata, values = "per_capita_income", color = "white") + 
  scale_fill_continuous(low = "sky blue", high = "black", name = "$ Per Capita", label = scales::comma) + 
  theme(legend.position = "right") + ggtitle("Income per Capita")+
  theme(
plot.title = element_text(color="Black", size=14, face="bold")
) 

We also wanted to take a look at the per capita income variable, as we believe it could be valuable within our analysis. Specifically, we are thinking that it could possibly be an indicator for the amount COVID Cases and or could really affect an individuals reaction to Trump’s tweet. This visualization however, is just uni-variate as we wanted to understand the variable a bit more. As you can see, the Northeast and the West coast tend to have higher incomes per capita than places like the South for instance.

#Income by state affiliation 
g <- ggplot(Finaldata, aes(x = per_capita_income, fill = StateColor)) + geom_density(alpha = .8)+ ggtitle("Income per Capita") + xlab("$") + ylab("Density") 


g + scale_fill_manual(values=c("blue", "purple", "red"))

After learning more about the per capita income variable we wanted to see how the variable would change when we added State Color into the mix. Where State Color signifies a state’s political leaning during the election. What is interesting is that the states that were red had much lower incomes per capita than states that were blue. We see that the blue states had a much wider and higher spread of income per capita ranging all the way to $45,000. Purple states on the other hand tended to be in the middle.

#Trump Support 
ggplot(Finaldata, aes(x = percent_republican2016)) + 
    geom_density()+ ggtitle("Trump Support During the 2016 Elections") + xlab("Percent of Trump Support") + ylab("Density") + facet_wrap(~Region)

This visualization is looking at Trump Support during the 2016 elections across different U.S. regions. From these visualizations it seems as though each region had pretty varying support. The Northeast had this interesting peak at 0.4 but also has another slope at around 0. The Western and Mountain regions have a similar pattern happening where there are strong peaks but then those peaks start dipping down fairly drastically. The South and the Midwest tend to lean more towards higher Trump support and we don’t see the “dipping” occur as much. This leads us to believe that during that election those regions had more states that favored Trump and the Republican party.

ggplot(Finaldata, aes(x = percent_asian, y = percent_republican2016, color = Winner)) + 
    scale_color_manual(values = c("blue","purple","red")) + 
    geom_point(alpha = 0.8) + geom_text(aes(label=ifelse(percent_asian>.2,as.character(state),"")),hjust=1.2,vjust=0)+ geom_text(aes(label=ifelse(percent_republican2016<.2,as.character(state),"")),hjust=-.1,vjust=0) + ggtitle("Percent of Asians and Trump Support by Election Outcome") + xlab("Percent Asian") + ylab("Trump Support")

This final demographic visualization is depicting Trump support across all states among those who identify as Asian, and is then classifying whether or not Trump won in that specific state. The two outliers are labeled as Hawaii and DC. These are interesting because we see that DC had a small percent of Asian residents and also had low Trump support this is kinda expected. However, we see that Hawaii had a much larger percentage of Asian identifying residents but also still had a decent amount of Trump support. All in all though, this visualization tells us that in a way the states that had lower Asian identifying residents tended to have higher Trump support.

COVID-19 Cases

## Creating a df of just the cases the week before 3/17
weekbefore <- data.frame(Finaldata) %>% filter(Day <= as.Date("2020-03-17"))

plot1<- plot_usmap(data = weekbefore, values = "positive", color = "white") + 
  scale_fill_gradient2(low = "white", high ="blue", mid = "skyblue",midpoint = 30,
                          name = "Cases (#)",label = scales::comma,
                          limits = c(0,500)) + 
  theme(legend.position = "right")+ ggtitle("# of COVID Cases Before 3/17") + 
  theme(
plot.title = element_text(color="Black", size=8)
)

## Creating a df of just the cases the week after 3/17
weekafter <- data.frame(Finaldata) %>% filter(Day >= as.Date("2020-03-17"))

plot2<-plot_usmap(data = weekafter, values = "positive", color = "white") + 
   scale_fill_gradient2(low = "white", high ="red", mid = "blue",midpoint = 500,
                          name = "Cases (#)",label = scales::comma,
                          limits = c(0,10000)) + 
  theme(legend.position = "right")+ ggtitle("# of COVID Cases After 3/17") + 
  theme(
plot.title = element_text(color="Black", size=5)
)

# The number of Positive Covid Cases, comparison before and after the China Virus announcement 
grid.arrange(plot1, plot2, ncol=2)

This visualization allows us to compare the raw count of positive COVID-19 cases among the US before and after March 17th (Trump says “China Virus”). The goal of this visualization is to show the number of positive COVID-19 cases as the time progresses. From the first plot we can see there are very few states that have positive cases above 0-100. However, we see states like Washington, Michigan, California and New York have higher numbers. The second plot shows that while there are very few states above ~2,000 there overall a higher number of positive cases among the US. This intuitively makes sense given this is an infectious disease and we would only expect there to be more positive cases over time.

Final Dataset

b<- ggplot(Finaldata, aes(x = ChinaVirusInterest, fill = as.factor(State))) + 
  geom_density(alpha = 0.5)

ggplotly(b)

We can see that the variability in Google interest in the term China Virus is has quite a large range between states. There are very few states that have high densities among the upper echelons of the interest scale but there are some interesting peaks of densities among the lower values. For example, we can see that Alaska, Wyoming and Iowa have unusal peaksaround the 25-50 range. It is is also interesitng interesting to note that there isn’t an obvious mean or median value of China Virus interest among the states.

c<- ggplot(Finaldata, aes(x = Day, y = ChinaVirusInterest, color = factor(State))) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE) + 
    facet_wrap(~ State) + 
    theme(legend.position = "none")

ggplotly(c)

With this visualization we wanted to observe the different slopes of each slope of interest in the term China Virus over time. Arguably, the slopes of interest among the states are not significantly different. There are some interesting downward slopes in states like Wyoming but we would need to dig a bit deeper into this to fully understand what it means. However, a really interesting facet of this viusalization is that we can see a pattern among all states ofpeaks and valleys. In other words, from observation, the majority of states demonstrate a pattern correlated to the day. This would be interesting to explore further and see how it interacts with other aspects like number of COVID cases by state.



Building Model

\[\begin{aligned} i = \text{region, } j = \text{date, } Y = \text{China Virus Interest}\\ Y_{ij}|b_0,b_1 \sim N(b_{0i} + b_{1i}Day_{i},\sigma^2)\\ b_{0i} \sim N(...)\\ b_{1i} \sim N(...)\\ \end{aligned}\]

set.seed(454)
Finaldata<-
  Finaldata%>%
  mutate(Day=as.numeric(Day))


model_1 <- stan_glmer(
  ChinaVirusInterest ~ (1 | Region),
  data = Finaldata, family = gaussian, chains = 4, iter = 5000*2
)
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.983 seconds (Warm-up)
## Chain 1:                4.071 seconds (Sampling)
## Chain 1:                8.054 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 4.375 seconds (Warm-up)
## Chain 2:                6.654 seconds (Sampling)
## Chain 2:                11.029 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 4.189 seconds (Warm-up)
## Chain 3:                4.764 seconds (Sampling)
## Chain 3:                8.953 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 4.941 seconds (Warm-up)
## Chain 4:                3.696 seconds (Sampling)
## Chain 4:                8.637 seconds (Total)
## Chain 4:
## Warning: There were 13 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
mcmc_trace(model_1)

mcmc_dens_overlay(model_1)

From what we can see from the trace plot the intercepts for each region are not very different from each other. We could see that it is possible that the midwest and the north east have a higher intercept compared to the west, but the mean is centered around 0 for almost all of them. We would rather use states to see if we see differing intercepts through states but unfortunatelly our code did not run.

Our model shows that the chains are stable maybe telling us that the model might work with lower chain iterations.

# Store the chains
model_1_df <- as.array(model_1) %>% 
  melt %>% 
  pivot_wider(names_from = parameters, values_from = value)

# Wrangle the chains
# Note the need for `...` around the sigma_sq_b
model_1_df <- model_1_df %>% 
  mutate(sigma_sq_w = `sigma`, sigma_sq_b = `Sigma[Region:(Intercept),(Intercept)]`) %>% 
  mutate(correlation = (sigma_sq_b/(sigma_sq_w+sigma_sq_b)))

ggplot(model_1_df, aes(x = correlation)) + 
  geom_density(alpha = 0.5)

In our correlation density plot we can see that the values are not very correlated accross regions meaning that regions is probably a bad way to segment them.

head(data.frame(summary(model_1)),-2)
##                                              mean        mcse         sd
## (Intercept)                           36.95340292 0.019676688  1.4371318
## b[(Intercept) Region:Midwest]          0.77427231 0.021061470  1.6126594
## b[(Intercept) Region:Mountain]         0.05328873 0.019625914  1.6513180
## b[(Intercept) Region:Northeast]        1.34912934 0.022448305  1.7568004
## b[(Intercept) Region:South]           -0.22563184 0.018683640  1.5661028
## b[(Intercept) Region:West]            -1.79649892 0.025021118  2.0899731
## sigma                                 18.91985542 0.004000357  0.4841495
## Sigma[Region:(Intercept),(Intercept)]  8.93404308 0.347485793 21.4982015
##                                             X10.          X50.       X90.
## (Intercept)                           35.3063220 37.0303694035 38.5142593
## b[(Intercept) Region:Midwest]         -0.8645058  0.5120844666  2.7741190
## b[(Intercept) Region:Mountain]        -1.8114164  0.0005233039  1.9885548
## b[(Intercept) Region:Northeast]       -0.3537904  1.0316081519  3.6119103
## b[(Intercept) Region:South]           -2.0501628 -0.1533411824  1.5110099
## b[(Intercept) Region:West]            -4.6517569 -1.3640026881  0.2458575
## sigma                                 18.3031300 18.9050957874 19.5429719
## Sigma[Region:(Intercept),(Intercept)]  0.1820068  3.4244641438 20.3000808
##                                       n_eff      Rhat
## (Intercept)                            5334 1.0000963
## b[(Intercept) Region:Midwest]          5863 1.0001977
## b[(Intercept) Region:Mountain]         7079 0.9999822
## b[(Intercept) Region:Northeast]        6125 1.0002203
## b[(Intercept) Region:South]            7026 0.9999207
## b[(Intercept) Region:West]             6977 1.0000634
## sigma                                 14647 0.9999579
## Sigma[Region:(Intercept),(Intercept)]  3828 1.0005962

Next Steps

The obvious improvements we need to make is exploring more models. We can see from our visualizations that are there are a wide variety of relationships we can dig into with our dataset. However, as a group, we need to decide what path(s) to take. Once we decide on which relationship(s) we want to analyze further, then this project can be enhanced with possibly animiated visualizations or even a shiny app.

Project Contributions

  • Quinn: Cleaned the COVID-19 data set, provided all visualizations and descriptions of COVID-19, provided the data description, ran analysis of the model, cleaned/formated checkpoint 3 document and knitted/turned in final checkpoint 3 product.
  • Sofia: Cleaned the demographic data set, provided all visualizations and descriptions of demographic aspects, provided the data descriptions for the dataset and created variables of interest table.
  • Will: Cleaned the Google Trend data set, provided all visualizations and descriptions of the CHina Virus and Kung Flu virus relationships and provided the data descriptions for the dataset.
  • Federico: Merged the datasets and provided brief descriptions of our building model section.